DOCUMENT RESUME 



ED 424 236 



TM 027 362 



AUTHOR 

TITLE 

INSTITUTION 

PUB DATE 
NOTE 

AVAILABLE FROM 

PUB TYPE 
EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Glas, Cees A. W. 

Detection of Differential Item Functioning Using Lagrange 
Multiplier Tests. Research Report 96-02. 

Twente Univ. , Enschede (Netherlands) . Faculty of Educational 
Science and Technology. 

1996-10-00 
44p . 

Faculty of Educational Science and Technology, University of 
Twente, P.O. Box 217, 7500 AE Enschede, The Netherlands. 

Reports - Evaluative (142) 

MF01/PC02 Plus Postage. 

Foreign Countries; *Item Bias; Item Response Theory; Scores; 
Secondary Education; Secondary School Students; Simulation; 
*Test Items 

Item Bias Detection; *Lagrange Multiplier Tests; 

Netherlands; One Parameter Model; Partial Credit Model; 

Rasch Model; Two Parameter Model 



ABSTRACT 



In this paper it is shown that differential item functioning 
can be evaluated using the Lagrange multiplier test or C . R. Rao ' s efficient 
score test. The test is presented in the framework of a number of item 
response theory (IRT) models such as the Rasch model, the one-parameter 
logistic model, the two-parameter logistic model, the generalized partial 
credit model, and the nominal response model. However, the paradigm for 
detection of differential item functioning presented here applies to other 
IRT models. The proposed method is based on a test statistic with a known 
asymptotic distribution. Two examples are given, one using simulated data and 
one using real data from 1,000 boys and 1,000 girls taking a Dutch secondary 
examination. (Contains 6 tables and 44 references.) (Author/SLD) 



******************************************************************************** 

* Reproductions supplied by EDRS are the best that can be made * 

* from the original document. * 

******************************************************************************** 



o 

ERIC 



TM02736SL 



Detection of Differential Item Functioning 
using Lagrange Multiplier Tests Research 

Report 

96-02 



□ 



U.S. DEPARTMENT OF EDUCATION 
Office of Educational Research and Improvement 
EDUCATIONAL RESOURCES INFORMATION 
CENTER (ERIC) 

’"This document has been reproduced as 
received from the person or organization 
originating it. 

Minor changes have been made to 
improve reproduction quality. 



\ 



Points of view or opinions stated in this 
document do not necessarily represent 
official OERI position or policy. 



Cees A.W. Glas 



■\ 

PERMISSION TO REPRODUCE AND 
DISSEMINATE THIS MATERIAL HAS 
BEEN GRANTED BY 






TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC) 




Detection of Differential Item Functioning 
using Lagrange Multiplier Tests 



Detection of Differential Item Functioning using Lagrange Multiplier Tests, Cees 
A W. Glas - Enschede: University of Twente, Faculty of Educational Science and 
Technology, December 1996. - 40 pages. 



Differential Item Functioning 
3 



\ 



Abstract 

In the present paper it is shown that differential item functioning can be evaluated 
using the Lagrange multiplier test or Rao’s efficient score test. The test is 
presented in the framework of a number of IRT models such as the Rasch model, 
the OPLM, the 2-parameter logistic model, the generalized partial credit model and 
the nominal response model. However, the paradigm for detection of differential 
item functioning presented here also applies to other IRT models. Two examples 
are given, one using simulated data and one using real data. 

Key words Item response theory, model fit, DIF, Rasch model, OPLM, 2-parameter 
logistic model, generalized partial credit model, nominal response model, Lagrange 
multiplier test, Rao’s efficient score test. 
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Introduction 

When a new test is constructed, it is important to find empirical evidence that 
contributes to the construct validity of the test (AERA, APA & NOME, 1985). Part 
of this process may be to show that the test fits a unidimensional item response 
theory (IRT) model, which means that the observed responses can be attributed to 
item and person parameters that are related to some unidimensional latent 
dimension. Construct validity is supported if the construct to be measured is also 
unidimensional and if the ordering of item difficulties imposed by the construct is 
reflected in the ordering of item parameters on the latent scale. Further, if it can be 
shown that the latent ability is unidimensional, a meaningful unidimen'sional 
variable for measuring the underlying construct can be created, either a minimal 
sufficient statistic or some other function of the observed responses, and the 
respondent can be assigned a value on the latent ability scale. So the IRT model 
validates the scoring rule of the test. Construct validity implies that the construct to 
be measured is the same for all respondents of the population the test is aimed at. 
This is where the problem of differential item functioning (DIF) or item bias arises. 
For reasons of semantic clarity, many authors prefer the terminology “DIF" to the 
older term "item bias" (see, for instance, Angoff, 1993 or Cole, 1993), in the 
present paper this practice is complied with. Studies of DIF deal with the question 
how item scores are affected by external variables that do not belong to the 
construct to be measured. Usually, the external variable imposes a division into a 
small number of sub-populations, where a sub-population refers to a set of 
persons that have the same value on the external variable. If the external variable 
is dichotomous, one usually speaks of the reference population, say the majority 
group or an advantaged group, and the focal population, say the minority or a 
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disadvantaged group. In DIF studies, the null-hypothesis is that the external 
variable does not moderate the effect of ability on the item scores. So the 
responses to a dichotomous item are subject to DIF if, conditional on ability level, 
the probability of a correct response differs over the samples from the various sub- 
populations (Mellenbergh, 1982, 1983). The generalization to polytomous items is 
straightforward. The responses to a polytomous item are subject to DIF if the set 
of probabilities of scoring in the various response categories of the item, 
conditional on ability, differs between the samples from different sub-populations. 
Another, equivalent definition of DIF is that the expected scores on the item, 
conditional on ability, are different for the sub-populations under consideration 
(Chang & Mazzeo, 1 994). 

The essential problem in DIF studies is whether the response behavior of the 
samples of all sub-populations can be properly described by an IRT model. An j 
additional problem is that the possible presence of DIF will influence the parameter 
estimates of all items, and this may confound model fitting. In the example section 
of this paper it will be shown that detection of DIF can be accomplished by an 
iterative process of model fitting, testing for DIF and modeling the responses to 
affected items, until a fitting model for all items and all samples of respondents is 
found. 

Several techniques for detecting DIF have been proposed. Most of them are 
based on evaluating differences in response probabilities between groups, 
conditional on some measure of ability. The most generally used technique is 
based on the Mantel-Haenszel statistic (Holland & Thayer, 1988), others are based 
on log-linear models (Kok, Mellenbergh & van der Flier, 1985), on IRT models 
(Hambleton & Rogers, 1989), or on log-linear IRT models (Kelderman, 1989). In 
the Mantel-Haenszel, log-linear and log-linear IRT approaches, the difficulty level 
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of the item is evaluated conditionally on the respondents’ unweighted sum scores. 
However, adopting the. assumption that the unweighted sum score is a sufficient 
statistic for ability (together with some technical assumptions, which will seldom be 
inappropriate) necessarily leads to the adoption of the Rasch model (Fischer, 
1974, 1993, 1995). However, with the exception of the log-linear IRT approach, the 
validity of the Rasch model is rarely explicitly tested. Therefore, Glas and Verhelst 
(1995) suggested a procedure consisting of two steps: 

(1) searching for an IRT model for fitting the data of the sample from the reference 
population, and, as far as possible, the sample from the focal population; 

(2) evaluating the differences in response probabilities between the two samples in 
homogeneous ability groups. 

In this paper, an alternative approach is investigated that has a strong 
resemblance to the above method. In the first step, Glas and Verhelst (1995) use 
a generalized version of the Rasch model where discrimination indices are imputed 
for dealing with differences in discrimination between the items. This model, known 
as the one parameter logistic model (OPLM), will be returned to below. These 
authors propose an iterative process of adjusting the discrimination indices using 
so-called generalized Pearson statistics, until an acceptable model fit is achieved. 
Evaluating the differences in response probabilities between the samples from the 
reference and focal population in homogeneous ability groups is also done using 
generalized Pearson statistics. The alternative approach of the present paper is 
not only applicable in the framework of the Rasch model and the OPLM, it can 
also be used in the context of the two-parameter logistic model and the nominal 
response model. These last two models are more flexible than the former models, 
but the tests for evaluating the fit to these models are less sophisticated, in fact, 
the asymptotic distribution of the statistics for these tests is unknown (see, for 
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instance, Mislevy & Bock, 1990). On the other hand, the generalized Pearson tests 
for the Rasch model and the OPLM completely rely on the existence of sufficient 
statistics (see Glas & Verhelst, 1995), so these tests cannot be used for 
performing the second step of the above approach for the two-parameter logistic 
and nominal response model. Therefore, in the present paper it will be shown that 
the second step can be performed using Lagrange multiplier (LM) tests. 

The remainder of this paper is organized as follows: (1) the relevant IRT 
models will be discussed, (2) an estimation procedure will be described, (3) the LM 
tests will be presented, and (4) two examples will be given, one using simulated 
data and one using real data. 



Choosing an IRT model 

In IRT models, the influence of items and persons on the observed responses are 
modelled by different sets of parameters. Since DIF is defined as the occurrence 
of differences in expected scores conditional on ability, IRT modelling seems 
especially fit for dealing with this problem. However, first the question must be 
answered which IRT models are appropriate in this context. Before considering 
some significant models for studying DIF, the following definitions must be 
introduced. Consider items where the possible responses can be coded by the 
integers 0, 1, 2, 3, .... m/. Let item / have m, + 1 response categories, indexed 

h = 0,1 mj. Notice that dichotomous items are the special case where 

m,- = 1 . The response to an item will be represented by a vector 

( Xj-f Xjft, Xjffij ) i where x ( y, is a realization of the random variable X,y, 

defined by 
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I 1 if a response is given in category h, 
0 if this is not the case. 



( 1 ) 



In this section, two classes of models will be considered. The first class comprises 
of exponential family IRT models, such, as the unidimensional Rasch model 
(UPRM) by Rasch (1960, 1961), the partial credit model (PCM) by Masters (1982), 
the one-parameter logistic model (OPLM) by Verhelst and Glas (1995) and the 
generalized PCM (GPCM) by Wilson and Masters (1993). The second class 
comprises of generalizations of the first class of models outside the exponential 
family, such as the two-parameter logistic model (2-PL) by Birnbaum (1968) and 
the nominal response model by Bock (1972). The motivation for making this 
distinction is that there are many statistical testing procedures based on statistics 
with known (asymptotical) distributions for the first class of models and hardly any 
such procedures for the latter class of models, this point will be returned to in the 
sequel. 

In the framework of polytomous items, Rasch (1960, 1961, see also, Andersen, 
1972, 1973b, 1977 and Fischer, 1974) has introduced several exponential family 
IRT models. In the model most suited for ability measurement, the UPRM, the 
probability of scoring in category h of item / is given by 



Xjh - 1 I 9n>P /) 



exp (hQ n - p ih ) 

1 + E exp(/c0 n - p/fr) 
k = 1 



( 2 ) 



where 6 n is the unidimensional ability parameter of person n, and 

P jfr h = 1 m/ are the parameters of item /' . For m ; - = 1 , equation (2) defines 

the item response function of the well-known Rasch model for dichotomous items. 
One of the reasons for considering this model is that it can be derived from a set 
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of assumptions which will often apply in the context of ability measurement. 
Andersen (1977) has shown that the UPRM can be derived form the assumption 
that R n = E //, hXfo is a minimal sufficient statistic for a unidimensional ability 
parameter 6, local stochastic independence and some technical assumptions. 
Masters (1982) develops a completely equivalent model from an entirely different 
perspective. Masters’ version, the PCM, can be derived from the assumption that 
every category h, h > 0, can be seen as a step that is either passed or failed. 
The final score on the item is determined by the number of steps that the 
respondent has successfully taken. Further, it is assumed that the probability of 
scoring in category h , rather than in category h - 1 , is described by a Rasch 
model for a dichotomous item with item parameter q//,. Glas and Verhelst (1989) 
have pointed out that the PCM is a reparametrization of the UPRM, that is, the 
parameters of the UPRM are obtained by the reparametrization 
P/7r = i "n/fir /i=1 m /- 

One of the attractive features of the UPRM is the possibility of using a 
conditional maximum likelihood method (CML) for obtaining consistent estimates of 
the item parameters (see Fischer, 1974, Molenaar, 1995). By conditioning on the 
minimal sufficient statistics R n a likelihood function is obtained that does not 
depend on the person parameters. This has the important advantage that 
computation of CML estimates does not need any assumption concerning the 
distribution of ability in the population. Further, these consistent estimates can, in 
principle, be obtained using any arbitrary sample of persons where the model 
holds. The less attractive feature of the model is that the possible form of the item 
response curve is rather restricted, for instance, for the dichotomous case the item 
response curves must be parallel in the sense that they are shifted along the latent 
continuum. Fortunately, many statistical tools are available for evaluating the fit of 
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the Rasch model. The assumption that the unweighted sum score is a minimal 
sufficient statistic for the person parameter and the assumption concerning the 
form of the item response curves are the focus of Martin Lot’s (1973) T-test, the 
Retest (Glas, 1988, Glas & Verhelst, 1989), the Uj-test (Molenaar, 1983) and the 
Sj- and M-tests (Verhelst & Glas, 1995, Verhelst, Glas & Verstralen, 1995). The 
property that the item parameters can be consistently estimated on every subgroup 
of the population is tested by Andersen’s likelihood ratio test (Andersen, 1973a) 
and the Fischer-Scheiblechner test (Fischer, 1974). Finally, the assumption of 
unidimensionality and local stochastic independence are the focus of the likelihood 
ratio test of Martin Lof (1973, 1974) and the R 2 -test of Glas (1988). 

The combination of the axiomatic foundation of the model and the tradition in 
social research and educational measurement of working with unweighted sum 
scores make the model an attractive starting point for statistical analyses. 
However, the restrictive character of the model will often obstruct model fit. There 
are several aspects of the Rasch model that may lead to rejection of the model. 
These violations can be accounted for by defining specific generalizations of the 
Rasch model. In this paper, the focus will be on models where the assumption of 
the form of the item response curves is relaxed. This can be done by introducing 
discrimination indices or discrimination parameters a//,, h=1,...,m ; -, so that 
equation (2) generalizes to 



Vihfin) ~ P r ( Xjh - 1 l®n ,a /’P/ ) 



ex P ( a ih®n ~ P/h) 

1 + E exp(a //( 0 n - p jk ) 
k = 1 



If the discrimination indices are viewed as known constants, this model can be 

K 

derived from the assumption that R n = E._. a//,X n //, is a sufficient statistic for 
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ability, local independence, and some technical assumptions (Andersen, 1977). In 
the framework of known discrimination indices, Verhelst and Glas (1995) have 
developed a CML estimation procedure and a procedure for evaluating model fit, 
for the so-called OPLM, where the item categories are assumed to have score 
weights a,/, = h a/. Recently, Glas (1997) has generalized this procedure to the 
more general GPCM by Wilson and Masters (1993), where item categories are 
given scoring weights a,/,. 

The discrimination indices can also be treated as unknown item parameters to 
be estimated, in the framework of dichotomous items this approach is known as 
the two-parameter logistic model (2-PL) by Birnbaum (1968). The nominal 
response model by Bock (1972) can be viewed as a generalization of the 2-PL to 
polytomous items. There are several considerations with respect to the choice 
between the two approaches. The OPLM and GPCM allow for CML estimation and 
have theoretically well-founded tools for testing model fit, in fact, most of the 
procedures mentioned above can easily be generalized to model (3) (Verhelst & 
Glas, 1995, Glas, 1997). On the other hand, the nominal response model is more 
flexible with respect to possible item response curves. This flexibility is bought at 
the expense of needing an MML estimation procedure for obtaining consistent 
estimates of the item parameters. This introduces assumptions with respect to the 
distribution of ability, which, of course, introduce another source of possible model 
violations that needs to be accounted for. However, attempting to generalize the 
complete catalogue of tests of model fit for exponential family IRT to non- 
exponential family IRT is far beyond the scope of the present paper; here only an 
alternative for the DIF tests of exponential family IRT will be studied. 
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Estimation 



In the present section, the well-known theory of MML estimation for IRT models 
will be re-iterated. In this presentation the formalism of Glas (1992) will be used, 
which, as will become apparent in the sequel, is especially suited for introducing 
LM tests for DIF. Consider the case of two sub-populations. A background variable 
will be defined by 



1 1 if person n belongs to the focal population, 

0 if person n belongs to the reference population. 



( 4 ) 



The absence of DIF entails that respondents of equal ability of different sub- 
populations have the same expected item scores. This, of course, does not mean 
that the expected item scores in the different sub-populations are the same, 
because it may well be the case that the ability distributions of the sub-populations 
are different. Let g{ 0 n ^y(n)) be the density of the ability distribution of sub- 
population y, with parameters A,.v n \, where y(n) =y n is the index of the sub- 

* f III 

population of person n. Further, if E, = (a fi ,X ) is the vector of all item and 
population parameters, the log-likelihood can be written as 

In L(k\X) = E n ln Pr(x n ^). (5) 

To derive the MML estimation equations, it proves convenient to introduce the 
vector of derivatives 

bffe) = A |n Prtx n ,e n ;^) = A[lnPr(x n | e n , a ,p) + lng(e n |A. y ( n ))]. (6) 
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Glas (1992) adopts an identity due to Louis (1982) to write the first order 
derivatives of (5) with respect to t, as 

4- |n L(i\X) = £„£(*>„£) \x n &). (7) 

This identity greatly simplifies the derivation of the likelihood equations. For 
instance, using the short-hand notation vy n //j = V//j(0 n ) , it can be easily verified 
that 

^n( a //j) = x nih ~ Vnih) 
and 

= Vnih~ x nih' 

so the likelihood equations are given by 

T- , nftPn x nih I ) = Vn/'h I x n-£ ) 0®) 

and 

^rPnij = Vn/'/j I x n& ) • ) 

The choice of a distribution of ability is not essential to the theory presented here; 
the test for DIF will both apply to the parametric MML framework (see Bock & 
Aitkin, 1982) non-parametric MML framework (see De Leeuw & Verhelst, 1986, 
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Follmann, 1988). As an example of the parametric context, one might assume that 
the ability distribution is normal with parameters p.y and cry. Then 

b n(^y(n)) = (6/7 - V-y{n)) c y(n) O 2 ) 

and 



b rl°y[n)) ~ c y(n) + ( e n - ^y(n)) 2 °y(n)’ < 13 ) 

so the likelihood equations are 

My = Tj-^n|y£(®n I x n & ) 0 4 ) 

N y 

and 

c y = -^-^n|y^(®nl x n>^) ~ My. 05) 

where the right-hand summations are over the respondents in the sample from 
sub-population y, Ny is the number of respondents in this sample. Below, this 
framework will be used for introducing a LM test for DIF, but first the principle of 
LM tests will be described. 





Differential Item Functioning 
15 



Lagrange multiplier tests 

Applications of LM tests to the framework of IRT have been described by Glas and 
Verhelst (1995). The principle of the LM test (Aitchison & Silvey, 1958), and the 
equivalent efficient-score test (Rao, 1948) can be summarized as follows. The 
arrangement of the LM test is the same as the arrangement of the likelihood-ratio 
test and the Wald test; all these three tests are used for testing a special model 
against a more general alternative. Consider a null-hypothesis about a model with 
parameters <J>q - This model is a special case of a general model with parameters 
<|> . In the present case the special model is derived from the general model by 
fixing one or more parameters to known constants. Let <t>o be partitioned as 
♦ ' = (<|>q.| , <J>Q 2 > = (<|>qi , e f ), where c is a vector of postulated constants. Let 
/)(<(>) be the ^partial derivatives of the log-likelihood of the general model, so 
/>(< |>) = (9/9< |>) In L(<(>) . This vector of partial derivatives gauges the change of the 

log-likelihood as a function of local changes in <|>. Let H(< (>,<{)) be defined as 

2 / 

- (9 /0<|> 0<|> ) lnL {( |> ) . Then the LM statistic is given by 

LM= /Jfoo/ Hfoo.^o)" 1 M 0 >- < 16 ) 

If (16) is evaluated using the ML estimate of <t>oi and the postulated values of c, 
it has an asymptotic chi-square distribution with degrees of freedom equal to the 
number of parameters fixed. 

An important computational aspect of the procedure is that at the point of the 
ML estimates 4 >qi the free parameters have a partial derivative equal to zero. 
Therefore, (16) can be computed as 



ERiC 



17 



Differential Item Functioning 
16 

LM(c) = h{c) r IV -1 h(c) (17) 

with 

w ,= H(c,c) - H(c, 4>qi) W(^qi .4>oi) _1 W M>01* C )- ( 18 ) 

Notice that H(4 >q •) , 4>gi ) also plays a role in the Newton-Raphson procedure for 
solving the estimation equations and in computation of the observed information 
matrix, so its inverse will generally by available at the end of the estimation 
procedure anyway. Further, if the validity of the model of the null-hypothesis is 
tested against various alternative models, the computational task is relieved 
because the inverse of H(0 qi , 4>qi ) is already available and the order of IV is 
equal to the number of parameters fixed, which must be small to keep the 
interpretation of the outcome tractable. 

The interpretation of the outcome of the test is supported by observing that the 
value of (17) depends on the magnitude of h(c), that is, on the first order 
derivatives with respect to the parameters <(>02 evaluated in c. If the absolute 
values of these derivatives are large, the fixed parameters are bound to change 
once they are set free, and the test is significant, that is, the special model is 
rejected. If the absolute values of these derivatives are small, the fixed parameters 
will probably show little change should they be set free, that is, the values at which 
these parameters are fixed in the special model are adequate and the test is not 
significant, that is, the special model is not rejected. 

The rationale of using LM tests rather than likelihood ratio tests and Wald tests 
is based on the fact that LM tests only need ML estimates of the parameters of the 
special model. In many instances, the parameters of the general model will be 



Differential Item Functioning 
17 



quite complicated to estimate. But even if this is not the case, this procedure still 
has the advantage that many alternatives can be considered without needing 
repeated estimation of all these alternatives. In the sequel it will be shown that the 
hypothesis of DIF can be tested for one item at a time. If this was done using a 
Wald or likelihood ratio test, it would require computing new estimates for every 
test. Further, DIF is just one of the many possible violations that may be of 
interest. Scanning the whole spectrum of violations of a non-exponential family IRT 
model without repeated estimation presents a promising direction for further 
research, but this is beyond the scope of the present paper. 



Lagrange Multiplier tests for DIF 



In section 3 the case of two sub-populations labeled y = 0, 1 , was considered. As 
a generalization of the model defined by (3) consider 



P r (Xjh-1 |yn>®n> a /-P pY i ) 



exp( a / 7 7 0 n -( 3 // 7 +y n (y, 7 7 0 n - 8 // 7 ) ) 



■ (19) 



1 + E exp(a// f 0 n -P// f +y n (y// f 0 n -8// f )) 
k= 1 



This model implies that the responses of the reference population are properly 
described by (3), but that the responses of the focus population need additional 
location parameters 8//,, additional discrimination parameters y//,, or both. In the 
dichotomous case, the first instance covers so-called uniform DIF, that is, a shift of 
the item response curve for the focal population, while the latter two cases are 
often labelled non-uniform DIF, that is, the item response curve for the focal 
population is not only shifted, but it also intersects the item response curve of the 
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reference population (Mellenbergh, 1982, 1983). Application of the LM test boils 
down to postulating a special model where y,y, and 8,y, are equal to zero and 

testing against the alternative that either y ; y,, h= 1 m,-, 8,y,. h = 1 m/ or 

both sets of parameters are non-zero. 

The rest of this section will be devoted to the derivation and the interpretation 
of the expressions for the LM statistic. As with the derivation of the estimation 
equations, also for the derivation of the matrix of second order derivatives the 
theory by Louis (1982) can be used. Using Glas (1992), it follows that the matrix of 
second order derivatives for the special model, 

H(U) - (20) 

dt, 9$' 

evaluated using MML estimates, is given by 



USA) = £„[3Brt(W) \*nA) + E(b^)b n (i)\x n ^)\, (21) 

where 



B&A) 



a 2 in Pi{x n ,e 
* 



( 22 ) 



Notice that the expressions for the second of the two right-hand terms of (21) can 
be directly derived from (8) and (9). The resulting expressions for some item / are 
listed in Table 1. The expressions for B n (£,£) involving two different items / and 
j are all equal to zero. 



Insert Table 1 about here 
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Inserting these structural zero’s and the expressions of Table 1 into (21) gives the 
expression for H(£,£) as far as the free item parameters are concerned. Further, 
from (6) it follows that for any population parameter Xy y = 0, 1 , 
B n (a.jfr\ v ) = Bfffijfrky) =0. Continuing the example of a normal ability 



This concludes the derivation of the expressions for for the free 

parameters in . 

The fixed parameters emerge from a general model, where it is assumed that 
for the focal population additional location 8 //, and discrimination parameters y//, 
have to be postulated. Under the null-hypothesis, these additional parameters are 
fixed at zero. For these fixed parameters, it can easily be shown that 



so the entries of the vector h(c) of the general LM statistic (17) are given by 




^n(Y/7)) = yn^rl x nih ~ Vnih) 



(23) 



and 



~ yn^nih ~ x nih) 



(24) 



Ntfil) ~ ^nyn x nih^n\ x n&) ~ ^nVn®t®nVn//)l x n'^) 



(25) 



and 



h(&ih) ~ ^nynE(Vnih\ x n'^>) ~ ^nyn x nih- 



(26) 
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Notice that the right-hand side of (26) is the difference between the expected and 
observed number of persons in the focal group scoring in category h of item / . 
So for dichotomous items the right-hand side of (26) is the difference between the 
observed number correct in the focal group and its expectation computed using 
parameter estimates obtained in both groups simultaneously. Since a test based 
on (26) is aimed at the hypothesis that there is no specific additional difficulty 8,y, 
present, it should be sensitive to uniform DIF, that is, a shift of the item response 
curve for the focal population. As a result of this shift, the observed number correct 
score for item / in the focal group will not be properly predicted if item parameter 
estimates obtained on both groups simultaneously will be used. This inconsistency 
between the observed and the predicted number correct score for item / in the 
focal group is exactly what is reflected in the difference in the right hand side of 
(26). If this difference is too large, the entry h(5//,) of h(c) will be large and the 
test will be significant. Also (25) is a difference between the expected and 
observed number of persons in the focal group scoring in category h of item / , 
but here the individual observations and expectations are weighted with the 
expectation of 0 given the observed individual response pattern. Therefore 
differences in the extremes of the ability range carry more weight than differences 
in the middle of the ability range. This is in accordance with the fact that the 
differences on the right-hand side of (25) arise when a test is derived for the 
hypothesis that the slope of the regression of the responses on 0 is the same for 
all groups. 

For computation of the LM statistic the matrix of second order derivatives wiih 
respect to the fixed and free parameters must be evaluated. Using equation (19) 
the reader can easily verify that for the fixed parameters 

®n(Y//)-Y/g) = yn^n( a ih a igl< ^rf^ih’^ig) = anc * 
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Bn^ifr^ig) ~ /n®n(P;7j>P/g)- 1° ,he same manner, it can also be derived that the 
second order derivatives with respect to fixed and free parameters are equal to 

B n(Yih’ a ig) = yn B n(<*ih’ a ig)> ^n^Y/frP/g) = /n®n( a /h’P/g)> 

B n(^ih' a igl = yn B n($ih' a igl< anc * ®n(fyfrP/g) = /n^niP/frP/g)- 

Again, inserting these expressions into (23) gives the desired expressions for the 

elements of Hfcfc). 



Some examples 

In this section, various examples of LM tests for DIF will be presented. These 
examples must be viewed as an illustration of the technique, not as an exhaustive 
power study. The first example concerns data simulated with the Rasch model for 
dichotomous items. The second example concerns a data set that was recently 
analyzed using the OPLM, CML estimates and generalized Pearson tests (Glas & 
Verhelst, 1995). It will be re-analyzed here using MML estimates and LM tests, 
both for the OPLM and the nominal response model. 
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To illustrate the possibilities of the technique, a number of simulation studies 
were carried out using data simulated for a test of 10 dichotomous Rasch items. 
The data for each replication consisted of 1000 response patterns for the 
reference group and 1000 response patterns for the focal group. The responses of 
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the reference group were generated according to a Rasch model, the item 
parameters used are given in the second and fourth column of Table 2. For the 
focal group, the items 1 through 6 and 10 were generated using the same Rasch 
model as for the reference group, but the responses for the items 7, 8 and 9 were 
generated using (19); the additional discrimination parameter y,- and difficulty 5/ 
are given in the third and fifth column of Table 2. The response patterns in the 
study were generated using normal ability distributions. To keep the illustration 
realistic, it was assumed that the means of the ability distribution of the reference 
group and the ability distribution of the focal group differed: the actual values used 
for generating the data are shown in the second column of the last four rows of 
Table 2. The remaining columns of this table give results of analyses averaged 
over 100 replications. For each replication, MML. estimates and their standard 
errors were computed. The means of the estimates of the item parameters are 
shown in the sixth and seventh column, the means of the estimates of the 
population parameters are shown in the last two columns of the four bottom lines 
of Table 2. In each replication, for each item three LM statistics were computed: 
LM(yj) to test whether y j departed from zero, LM (8,) to test whether 8/ 
departed from zero, and L/tf(y/,8j) to perform the test whether y,- and 8/ 
simultaneously departed from zero. The results are given in the last nine columns 
of Table 2. The columns labeled “LM(yj)“, °LM(6j)° and “L/W(y ; -,8 ; )“ contain the 
means of the test statistics, the columns labels °Pr° contain the mean probability 
levels of the statistics and the columns labeled "Nr" contain the number of times 
that the test was significant at the 5%-level. From the first columns of this table it 
can be seen that the responses to item 7 are subject to uniform DIF only, that is, 
8/ * 0, item 8 is subject to non-uniform DIF only, that is, y/ * 0, and item 9 both 
shows uniform and non-uniform DIF, so here both 8/* 0 and y ; * 0. The results 
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show that the LM tests are indeed sensitive to the various forms of DIF imposed. 
For the items 8 and 9, the mean significance probabilities of LM(yj) are below 
0.022 and 0.033, respectively. Further, the test is significant at the 5%-level in 87 
and 76 replications. The LM (bj) test for the items 7 and 9 has a probability level 
below 0.001 and 0.004 and the hypothesis of no uniform DIF is rejected at the 5%- 
level in 100 and 97 percent of the cases. Finally, for all three items, LM (y/, 8/) is 
significant at the 5%-level in 99, 91 and 100 percent of the replications, the mean 
significance probabilities are below 0.003, 0.024 and 0.001 , respectively. The DIF 
imposed on the three items does, of course, result in some bias in the parameter 
estimates of the other items, which, in turn, results in an augmentation of the 
number of erroneously significant LM tests. However, the consequences of this 
effect must not be exaggerated: it can be seen that the mean outcome and 
probability levels of the tests for the items not affected by DIF are substantially 
different from the same indices for the items where the responses are subject to 
DIF. Therefore, it is a advisable to adopt a procedure where the items with the 
most extreme outcomes are handled first, either by removing them or by modelling 
the responses to these items further, an example will be given below. For the 
present example, removing the items with DIF resulted in rejection rates of the 
hypothesis of no DIF for the other items at the proper chance level. 

The second example entails a data set recently analyzed by Glas and Verhelst 
(1995) using the OPLM and generalized Pearson statistics. The objective of the 
present analysis is to investigate whether the DIF detected by these two authors 
will also be detected if LM tests are used, first in combination with the OPLM and 
then using the nominal response model. The example comprises of a part of an 
examination of the business curriculum for the Dutch higher secondary education, 
the HAVO level. The example was part of a larger study of gender based DIF in 
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examinations in secondary education. Since the objective, both here and in the 
Glas and Verhelst (1995) paper, is to illustrate the statistical procedures rather 
than to give an account of the findings with respect to gender based DIF, no actual 
examples of items with DIF will be shown. For a detailed report of the findings one 
is referred to Bugel and Glas (1992). The analyses were carried out using a 
sample of 1000 boys and 1000 girls from the complete examination population. For 
convenience of presentation the example is limited to 10 items. The items are 
open ended questions, the number of score points that could be obtained ranged 
from mj = 2 to m,- = 3 ; the exact 
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distribution of score points over the items can be seen in the second column of 
Table 3. 

In the first analysis, the OPLM was used. Glas and Verhelst (1995) have fitted 
an OPLM to the data used here, the discrimination indices that proved adequate 
are shown in the third column of Table 3. These indices were also used in the 
present analyses. MML estimates were computed under the assumption of 
different normal ability distributions for the boys and the girls. The results of this 
MML estimation procedure are given in the columns marked “p //,', "Se( (3 //,)“, 
"|iy", “Se(|iy) n , 11 a y" and "Se(6y)“ and under the heading “Analysis 1”. Glas 
and Verhelst (1995) have pointed out that the adequacy of the chosen scoring 
weights can be evaluated using a LM statistic for testing whether the value at 
which a/ is fixed is acceptable. This test, denoted LM(a^), was computed for 



Differential Item Functioning 
25 



every category within an item, that is, for every category h of item / it was tested 
whether the hypothesis a//, = h a / had to be rejected. The results of this test are 
displayed in the columns marked ' LM( a//,) " and "Prob u . It can be seen that the 
items 3 and 9 do not fit the model. However, at this point it is unclear whether this 
lack of fit is due to DIF, since it might well be the case that the chosen 
discrimination index was inappropriate for boys and girls alike. Therefore, the LM 
statistics proposed in this paper were computed for testing whether non-zero shift 
parameters 8//^ h = 1,.,.,/n/, had to be added for the girls. The test was 
performed per item for all item category parameters simultaneously, therefore the 
test is labeled LM( 5 /) . The results are shown in the columns marked u LM( 6 /) " 
and "Prob" of Table 4. It can be seen that the test is highly significant for the items 
3 and 9. 
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However, the test is also significant at a 5% level for the items 1 and 10. 
Interestingly, these results are similar to the results of the Glas and Verhelst 
(1995) analysis: also there the items 3 and 9 were highly significant and the items 
1 and 10 moderately significant. As already noted above, the presence of DIF can 
bias the estimates of the parameters of items that are not influenced by DIF. 
Therefore, it is a advisable to try to model DIF for the highly significant items 
before drawing conclusions for the other items. The following additional analyses 
were carried out. First, item 9 was entered into the analysis as a different item for 
the boys and the girls, that is, it was assumed that the item parameters |5 ; y, were 
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different for these two groups. However, from computation of the/JW(a //,) 
statistics it had to be concluded that the scoring weights a / also differed across 
the two groups, this result was also encountered in the Glas and Verhelst (1995) 
analysis. Changing this weight from 4 to 2 resulted in non-significant LM( a//,) 
tests. In this analysis, also the L/W(8 ; ) statistics were computed, the results are 
shown under the headings "Analysis 2" in Table 4. The LM( 8/) statistic could not 
be computed for item 9 since it was split into two so-called conceptual items. 
Notice that the test for item 1 is no longer significant at 5% level. Next, this 
procedure was repeated with item 3 split up into two conceptual items and both 
the items 3 and 9 split up, respectively. The results are displayed under the 
heading “Analysis 3" and "Analysis 4“ in Table 4. It can be seen that in the last 
analysis all LM(bj) statistics are non-significant. In Table 3, the parameter 
estimates and the LM( a//,) statistics for the last analysis are shown. Inspection 
shows that also these last statistics are no longer significant at the 5% level. So 
after splitting up the items 3 and 9 into different conceptual items for the two 
groups, an OPLM could be fitted to the data. This result is consistent with the 
results of the Glas and Verhelst (1995) analyses. 
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Finally, it was investigated how the procedure would perform if the nominal 
response model was used instead of the OPLM. From the previous analyses it is 
already apparent that the OPLM fits the data quite well, so the nominal response 
model should give results close to the previous results. In Table 5 the parameter 
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estimates are shown for two analyses with the same arrangement as the analysis 
labeled "Analysis 1“ and "Analysis 4” in Table 3. It can be seen that the estimates 
of the scoring weights a//, are in accordance with the weights a / postulated for 
the OPLM. Also the estimates of (3//, differ little. 

In Table 6 the values of the LM ( y ; -,8 /) statistics are shown for four analyses 
comparable to the four analysis of Table 4. The LM{ y 8 ,) statistic is used to 
test the simultaneous 
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hypotheses that the parameters y ; y, and 8/^ h = 1 ,...,m/ are all equal to zero. It 
can be seen that also in the present case the items 3 and 9 show DIF. However, 
in this case the tests for the items 4 and 10 were also significant in the first 
analysis. As with the previous analyses, this significant result vanished when the 
items 3 and 9 were split into conceptual items for boys and girls. Again, this shows 
that it is important to investigate the items one at a time, starting with the items 
that seem to show the most serious DIF, because DIF in one item may affect the 
estimates of the parameters of the other items in such a way that the LM tests 
produce spurious results. 
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Discussion 

In the present paper a method for detection of DIF is proposed that is based on a 
test statistic with a known asymptotical distribution. In the simulated example, it is 
shown that the method cannot only be used to detect DIF, it can also be used to 
distinguish between uniform and non-uniform DIF. The validity of the procedure is 
further supported with a real data example, where the results obtained are in 
agreement with the results obtained using the OPLM, in combination with CML 
estimates and generalized Pearson statistics. However, a choice between the two 
methods is not straight toward. The LM procedure can handle a wider array of 
IRT models than the procedure based on generalized Pearson statistics, which can 
only be applied in the framework of exponential family IRT models. On the other 
hand, the latter procedure can be embedded in a procedure where' various 
sources of model violations can be systematically evaluated, whereas evaluation 
methods of model fit for non-exponential family IRT are still rather unsophisticated. 
This is the more serious because estimation in non-exponential family IRT relies 
on assumptions about the ability distribution. These assumptions are an integral 
part of the model and should be tested also. In summary, there is no clear answer 
to the question which method is to be preferred. 

In the present paper the LM method for detection of DIF is worked out in detail, 
implemented and evaluated for the OPLM and the nominal response model with 
normal ability distributions. However, the procedure does not only apply to the 
models discussed here, it also applies to other unidimensional IRT models, such 
as for instance the models proposed by Samejima (1969, 1972) and to multidimen- 
sional models such as the model proposed by Glas (1992). Further, the 
assumption of one or more normal ability distributions can be replaced with the 
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assumption of the non-parametric MML method that the distribution of ability can 
be represented by one or more step-functions (De Leeuw & Verhelst, 1986, 
Follmann, 1988). Elaboration, implementation and evaluation of these applications 
is a topic for further research. 
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Table 3 

Parameter Estimates and Model fit for the OPLM 
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Parameter Estimates for the Nominal Response Model 
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